Lemmatization for variation-rich languages using deep learning

نویسندگان

  • Mike Kestemont
  • Guy De Pauw
  • Renske van Nie
  • Walter Daelemans
چکیده

In this article, we describe a novel approach to sequence tagging for languages that are rich in (e.g. orthographic) surface variation. We focus on lemmatization, a basic step in many processing pipelines in the Digital Humanities. While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and unstable orthography. Our approach is based on recent advances in the field of ‘deep’ representation learning, where neural networks have led to a dramatic increase in performance across several domains. The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings to represent the lexical context surrounding the input words. We demonstrate how this system reaches state-ofthe-art performance on a number of representative Middle Dutch data sets, even without corpus-specific parameter tuning. .................................................................................................................................................................................

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Lemmatization Web Service Based on Machine Learning Techniques

Lemmatization is the process of finding the normalized form of words from surface word-forms as they appear in the running text. It is a useful pre-processing step for any number of language engineering tasks, esp. important for languages with rich inflection morphology. This paper presents two approaches to automated word lemmatization, which both use machine learning techniques to learn parti...

متن کامل

Simple Data-Driven Context-Sensitive Lemmatization

Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a S...

متن کامل

Learning Morphology with Morfette

Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models an...

متن کامل

Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks

We introduce a composite deep neural network architecture for supervised and language independent context sensitive lemmatization. The proposed method considers the task as to identify the correct edit tree representing the transformation between a word-lemma pair. To find the lemma of a surface word, we exploit two successive bidirectional gated recurrent structures the first one is used to ex...

متن کامل

Dealing with word-internal modification and spelling variation in data-driven lemmatization

This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of differe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • DSH

دوره 32  شماره 

صفحات  -

تاریخ انتشار 2017